Though unintuitive at first, R offers a powerful way to visualize large data sets. With the right package, you can make charts and maps in R that look equally (if not better) than those you can make in ArcGIS or Illustrator. Most importantly, R is free and open-source and hence allows much smaller organisations (who cannot afford expensive ArcGIS or Adobe license) to leverage the power of GIS. More on ArcGIS and ESRI’s monopoly of GIS here. Whenever possible we should challenge the software monopoly of big corporations because:
It is bad for free market competition—without spending too much time on this, ArcGIS has many flaws as a spatial analysis product that it is not motivated to fix because there is no competition.
It silos public data in the hands of private corporations, creating a dependency on one organization. Why should ESRI get to monitise data collected by public entities using public tax dollars?
In this document, I hope to show you the basics of data visualizations in R so that you can keep making maps and graphs even after we loose educational ArcGIS and Adobe licenses.
ggplot is one of the most popular R data visualization packages. ggplot stands for “Grammar of Graphics” and is a way to use language (rather than point and click) to create charts. STHDA has a really good introduction to it here, but the basic of ggplot is to view it as a collection of layers.
According to ggplot2 concept, a plot can be divided into different fundamental parts : Plot = data + Aesthetics + Geometry.
The principal components of every plot can be defined as follow:
data is a data frame
Aesthetics is used to indicate x and y variables. It can also be used to control the color, the size or the shape of points, the height of bars, etc…..
Geometry corresponds to the type of graphics (histogram, box plot, line plot, density plot, dot plot, ….)
You really can do a lot with it! Lets go through the steps of creating a simple bar chart in ggplot.
First, lets set up our environment and call our data. When setting up your data for plotting, remember that the column names appear as the axis labels. Hence, when I set up my pivoted data, I try to use column names that would make the most sense.
#start by installing the packages
if(!require(pacman)){install.packages("pacman"); library(pacman)}
## Loading required package: pacman
p_load(tidyverse, ggplot2, sf, tigris, tidycensus)
#call your variables, here I am using commuting statistics from the ACS
commute_vars <- c(
total_workers = "B08006_001",
drive_alone = "B08006_003",
carpool = "B08006_004",
public_transit = "B08006_008",
walked = "B08006_015",
bicycle = "B08006_014",
worked_at_home = "B08006_017")
philly_commute.2022 <- get_acs(
geography = "county",
variables = commute_vars,
state = "PA",
county = "Philadelphia",
year = 2022,
survey = "acs5",
output = "wide")%>%
mutate( #create percentage classes for mapping
Drive = drive_aloneE/total_workersE * 100,
Carpool = carpoolE/total_workersE * 100,
Transit = public_transitE/total_workersE * 100,
Walk = walkedE/total_workersE * 100,
Bicycle = bicycleE/total_workersE * 100,
WFH = worked_at_homeE/total_workersE * 100
)%>%
select(Drive, Carpool, Transit, Walk, Bicycle, WFH) %>%
pivot_longer( #pivot to a cleaner look for plotting
cols = everything(),
names_to = "mode",
values_to = "percentage"
)
## Getting data from the 2018-2022 5-year ACS
Now that our data is set up, let us begin by creating a rough plot. ggplot works by layering objects on top of one another, so we start by specifying the data layer (here being bars of the bar chart).
R has more than 600 named colours that it recognizes. A list of all such colours can be found here.
Now that we have our base chart, we can begin our customizations. Lets start by geom_label (or geom_text) which adds text labels to our chart objects.
plot1+
geom_text(#sets the text layer on top of the plot
aes(label = paste0((round(percentage, 1)), "%")),#here, we want the y label to be percentage
#we use round to make it look pretty
nudge_y = 3, #we want the text to go up by 4 pixels
)
We have our text labels! Stylistically I prefer using the geom_label function instead of geom_text as it renders labels with a filled background. So lets replace that.
plot1 <- plot1 +
geom_label(
aes(label = paste0((round(percentage,1)), "%")),
size = 4, # how big we want it to be
nudge_y = 4, # move y by 4 px
fill = "#F6AE2D", #this sets the fill color
color = "white", #this sets the font color
alpha = 0.9) #make the label transparent
plot1
Now, lets add more text to our plot, i.e. change the labels using the labs() command.
plot1 <- plot1 +
labs( #set the labels for the chart
title = "How Philadelphia Commutes to Work",
subtitle = "Percentage by Commute Choice",
caption = "Data: ACS 2022",
x = NULL, # Remove x and y axis label since it is self explanatory
y = NULL
)
plot1
The next thing you will need to do is specify the themes, which are other stylistic elements of the chart. Themes are where you would adjust other elements of the chart such as gridlines, text positions etc. You can choose between preset R themes, specify your own thing, or do both! There are a lot of themes you can choose from.
plot1 +
theme_linedraw()
plot1 +
theme_classic()
Personally, I prefer theme_minimal, so that is what we will use for the plot.
plot1 <- plot1 +
theme_minimal()
plot1
I prefer using the minimal theme alongside customizing other elements (such as fonts) using the theme function. Using the packages showtext and sysfonts, you can import google fonts or any other system fonts downloaded on your device.
#first, lets call in our packages
p_load(showtext,sysfonts)
#now, lets call in our desired font from google fonts
#here I want to call IBM Plex Sans from google font
font_add_google("IBM Plex Sans") #calls the font
showtext_auto() #initializes it in the environment
plot1 +
theme(
text = element_text(family = "IBM Plex Sans", color = "#2F4858"),
#sets style for all text
plot.title = element_text(size = 16, face = "bold"), #adjust formatting of title
plot.subtitle = element_text(size = 12),
axis.text = element_text(size = 11, family = "IBM Plex Sans", color = "#2F4858"))
The theme command gives you quite a lot of freedom to change the elements in your plot! I would encourage you to explore it further. For now, I want to adjust my gridlines and other plot elements.
plot1 <- plot1 +
theme_minimal() + #sets a larger theme, google ggplot themes to see the options
theme(
text = element_text(family = "IBM Plex Sans", color = "#2F4858"), # set style for
#all text
plot.title = element_text(size = 16, face = "bold"), #adjust formatting of title
plot.subtitle = element_text(size = 12),
axis.text = element_text(size = 11, family = "IBM Plex Sans", color = "#2F4858"),
#set text style for axis text (which is different than plot text, idk why)
panel.grid.minor = element_blank(), #get rid of minor gridlines
plot.margin = margin(10, 10, 10, 10), # add padding
plot.caption = element_text(size = 8, margin = margin(t = 10)), #pad down caption
panel.grid.major.y = element_line(color = "#F6AE2D", size = 0.2,
linetype = "dotted"),
#have y grids
panel.grid.major.x = element_blank() #and get rid of x axis grids
)
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plot1
And there is my plot! I can now export this chart to put in any document I see fit using the ggsave function. You can export it as a jpeg, a png, a pdf, or a svg to further edit in other software!
ggsave(plot = plot1,
"commuteplot1.png", #change the .png to .svg or .pdf as needed
bg = "transparent", #this will create a png with a transparent background
#remove if you want the fill color as you set it
width = 10, #width of the image
height = 5) #height of the image
A note of caution is that your saved plot might end up looking very different than your plot output in quarto. This happens because quarto will compress image outputs. Hence, you may need to play around with sizes of fonts and such in order to make your final plot look good.
Great, I hope you now understand what we mean when we say ggplot layers—its not too different than illustrator layers! Think of ggplot commands as steps you would need to do in illustrator—only here instead of mouse clicks, you memorize the steps instead. Lets move on to another example. As before, I call in my data and arrange it in the correct structure.
plot2_vars <- c(
total_workers = "B08006_001",
public_transit = "B08006_008",
medhhinc = "B19013_001"
)
plot2_data <- get_acs(
geography = "tract",
variables = plot2_vars,
state = "PA",
county = "Philadelphia",
year = 2022,
survey = "acs5",
output = "wide")%>%
mutate(
Transit = public_transitE/total_workersE * 100,
Income = medhhincE
)%>%
select(GEOID, Transit, Income)%>%
filter(!is.na(Income))
## Getting data from the 2018-2022 5-year ACS
For this plot, I want to use a new google font called Jost. I will not show how to call system fonts in this document as it is slightly complicated and I don’t know how you would do it on a windows device. However, if you are interested, you can read about it here.
I also call in the package scales, which is a nifty little tool to get various metrics (percentage, money) formatted to look pretty in your plot.
font_add_google("Jost")
showtext_auto()
p_load(scales)
Here, I am plotting a scatter plot. The scales package is especially useful when I am setting the labels of my x axis. Here, I use it shorten dollar amounts to the first three numbers and add the suffix k behind it.
plot2 <- ggplot(plot2_data,
aes(x = Income,
y = Transit))+
geom_point(
stat = "identity",
colour = "dodgerblue4",
alpha = 0.6,
size = 1.5
)+
scale_x_continuous(
labels = scales::label_dollar(scale = 0.001, suffix = "k"))+
#here, the scale = 0.001 basically means to multiply the number by 0.001
#recall decimals from elementary school! Multiplying 500,000 by 0.001
#will give us 500!
#then, I add k at the back to tie it to together.
scale_y_continuous(
limits = c(0, max(plot2_data$Transit)))
# i also set the limit of y axis as the max percentage value of transit
plot2
The rest of this section is about setting the theme and customisations. Here, I set the fill of the plot to a different colour. However, this means that I won’t be able to export it as a transparent png.
plot2+
theme_minimal()+
theme(
text = element_text(family = "Jost", color = "steelblue4"),
axis.text = element_text(size = 11, family = "IBM Plex Sans", color = "steelblue4"),
panel.grid.minor = element_blank(),
axis.title.x = element_text(vjust = -4),
axis.title.y = element_text(vjust = 4),
legend.title = element_text(size = 12),
plot.margin = margin(10, 10, 10, 10),
plot.caption = element_text(size = 8, margin = margin(t = 10)),
plot.title = element_text(size = 14, face = "bold"),
plot.background = element_rect(fill = "seashell", color = NA), #set background fill
panel.background = element_rect(fill = "seashell", color = NA)
#here, I have to set the fill of plot and panel separately because in ggplot
#they are two different things. Play around with changing the fill colour to see
#how it works
) +
labs(
title = "Less People Take Transit as Census Tracts Get Richer",
subtitle = "Median Income by Transit Commuters in Philadelphia",
caption = "Data: ACS 2022",
x = "Median Household Income",
y = "% of Transit Commuters")
Okay one last trick—you can also animate your graphs using ggplot! For this, let us call in the data on affordable housing production in Philadelphia from OpenDataPhilly. Here, rather than downloading the data manually, I tell R to download the data directly from the url. You can do this too—just right click the data you want, copy the clean link, and put it in place of my url. You may need to change the calling function for different file types—since I am downloading a geojson, I use the st_read which reads spatial files. If you were calling in a csv, you would want to use read.csv.
#load in the packages that we will need
p_load(gganimate, gifski)
affordablehousing <- st_read("https://hub.arcgis.com/api/v3/datasets/ca8944190b604b2aae7eff17c8dd9ef5_0/downloads/data?format=geojson&spatialRefId=4326&where=1%3D1")
## Reading layer `Affordable_Housing' from data source
## `https://hub.arcgis.com/api/v3/datasets/ca8944190b604b2aae7eff17c8dd9ef5_0/downloads/data?format=geojson&spatialRefId=4326&where=1%3D1'
## using driver `GeoJSON'
## replacing null geometries with empty geometries
## Simple feature collection with 443 features and 11 fields (with 25 geometries empty)
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -75.24604 ymin: 39.90183 xmax: -74.9829 ymax: 40.12171
## Geodetic CRS: WGS 84
plot3data <- affordablehousing %>%
mutate(PROJECT_TYPE = case_when(PROJECT_TYPE ==
"Rental;Special Needs" ~ "Special Needs",
TRUE ~ PROJECT_TYPE),
#simplifying one catagory for plotting
Year = as.numeric(FISCAL_YEAR_COMPLETE))%>%
group_by(Year, PROJECT_TYPE)%>%
summarize(total_units = sum(TOTAL_UNITS)) %>%
ungroup()%>%
filter(!is.na(Year) &
!is.na(total_units))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
#removing NA values
As before, we manipulate our data and get it ready to plot. We also follow similar steps as before to create our base plot, with two key command differences. In this chart, I used scale_fill_manual to manually set the colours of my categorial data.
plot3 <- ggplot(plot3data,
aes(x = PROJECT_TYPE,
y = total_units,
fill = PROJECT_TYPE)) +
geom_col(position = position_dodge(width = 0.9)) +
scale_fill_manual(values = c( #when you are working with catagorial data
"Rental" = "#2a9d8f",
"Special Needs" = "#e76f51",
"Homeownership" = "#f4a261"))+
geom_label(
aes(label = total_units),
size = 4,
nudge_y = 4,
fill = "#e9c46a",
color = "white",
alpha = 0.9) +
theme_minimal()+
theme(
text = element_text(color = "steelblue4"),
axis.text = element_text(size = 11, color = "steelblue4"),
legend.position = "none",
plot.title = element_text(size = 20, face = "bold"),
plot.subtitle = element_text(size = 12),
panel.grid.major.y = element_line(color = "steelblue4", size = 0.2,
linetype = "dotted"),
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank()) +
labs(
title = "What types of Affordable Housing Has Philly Been Building?",
subtitle = "Year: {closest_state}", #here, subtitle will be the closest state
#to the animated frame, more on states below
caption = "Data: Open Data Philly",
x = NULL,
y = "Total Units"
)
plot3
The above section however, just gives me the rough plot of my chart. If you print it, it looks very weird because all the unit values from various years are mapped on top of one another. To animate it, you would need to add the following sections.
plot3 <- plot3 +
#the segments that allow you to animate
transition_states(Year, #the variable which will define the frames of animation
#or "states". Here, the gif moves between various Years in ach frame
transition_length = 5, #the length of each transition
state_length = 4) + #how long the state should last
ease_aes('linear') #the style of animating
I will not go too much into the details of gganimate because it is vast and you really only get better at it through trial and error. You can find a really good cheatsheet here if you are curious to learn further.
In short, to animate, you will just need to specify the variable through which you want to transition (here defined as Year). Once you have your specifications, you can render your gif using the animate command.
#uncomment these sections when you run the markdown
#however, you will need to encode the gifs seperate to knit them
#to add gifs just write this in your file 
#plot3animate <- animate(plot3,
#nframes = 29, #the number of frames, here equal to the number of years in data
#fps = 1, #frame per second
#width = 800,
#height = 600)
#anim_save("plot3.gif", plot3animate)
You really can do a lot with gganimate! Here is another example, adapted from the gganimate documentation page here. In this plot, we will be showing change in life expectancy over GDP per capital in various countries. The dataset to do so is already loaded in R.
#calling a pre-existing dataset in R
p_load(gapminder)
plot4 <- ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, colour = country)) +
geom_point(alpha = 0.7, show.legend = FALSE) +
scale_colour_manual(values = country_colors) +
scale_size(range = c(2, 12)) +
scale_x_log10()+
facet_wrap(~continent) +
theme_minimal()+
theme(
text = element_text(color = "steelblue4"),
axis.text = element_text(size = 11, color = "steelblue4"),
panel.grid.major.y = element_line(color = "steelblue4", size = 0.2,
linetype = "dotted"),
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
axis.text.x = element_blank(),
)+
labs(title = "Does GDP per capita correlate with Life Expectancy",
subtitle = 'Year: {frame_time}',
x = 'GDP per capita',
y = 'Life expectancy') +
transition_time(year) +
ease_aes('linear')
#uncomment these sections when you run the markdown
#however, you will need to encode the gifs seperate to knit them
#to add gifs just write this in your file 
#plot4animate <- animate(plot4,
#width = 800,
#height = 600)
#anim_save("plot4.gif", plot4animate)
Okay, now lets try to plot some maps! To start, lets plot a simple chloropleth using the transit data we called previously. The package Tigris allows us to easily call shapefiles of various US political subdivisions and is a powerful tool when plotting US data.
philly <- tigris::tracts(state = "PA", county = "Philadelphia", year = 2022)
plot(philly[4])
A
cool thing about tigris is that it can help you automatically erase
sections of the map that would be covered by water. I find that having
water bodies erased gives you prettier maps. I also transform the plot
into a coordinate reference system more suitable for mapping
Philadelphia.
philly <- philly %>%
tigris::erase_water(area_threshold = 0.8)%>%
st_transform(2272)
plot(philly[4])
If
you notice, now my map has gaps in it to account for water bodies.
Another cool thing you can do with tigris is to call polygons of water
bodies in an area. Again, I find that this makes for a prettier plot. So
lets call the water of all the neighbouring counties and then crop the
shapefile to the extent of Philadelphia.
philly_water <- tigris::area_water(state = "PA", county = "Philadelphia", year = 2022)%>%
rbind(tigris::area_water(state = "PA", county = "Montgomery", year = 2022))%>%
rbind(tigris::area_water(state = "PA", county = "Delaware", year = 2022))%>%
rbind(tigris::area_water(state = "PA", county = "Bucks", year = 2022))%>%
rbind(tigris::area_water(state = "NJ", county = "Burlington", year = 2022))%>%
rbind(tigris::area_water(state = "NJ", county = "Camden", year = 2022))%>%
rbind(tigris::area_water(state = "NJ", county = "Gloucester", year = 2022))%>%
st_transform(2272)%>%
st_crop(st_bbox(st_buffer(philly, dist = 5000))) #buffer so it is philly and a bit more
ggplot() +
geom_sf(data = philly, fill = "seashell3", color = NA, alpha = 0.6)+
geom_sf(data = philly_water, fill = "dodgerblue4", color = "dodgerblue", alpha = 0.6)
We can also call up the shapefiles of PA and NJ that are adjacent to Philly. We don’t need any gridlines or variables in our data, so I specify the theme to be theme_void().
pa <- tigris::counties(state = "PA", year = 2022)%>%
st_transform(2272)%>%
st_crop(st_bbox(st_buffer(philly, dist = 5000)))%>%
erase_water(area_threshold = 0.9)
jersey <- tigris::states()%>%
filter(STUSPS == "NJ")%>%
st_transform(2272)%>%
st_crop(st_bbox(st_buffer(philly, dist = 5000)))%>%
erase_water(area_threshold = 0.8)
ggplot() +
geom_sf(data = pa, fill = "seashell3", col = "white", alpha = 0.9)+
geom_sf(data = jersey, fill = "seashell3", col = "white", alpha = 0.9)+
geom_sf(data = philly_water, fill = "dodgerblue4", col = "dodgerblue", alpha = 0.8) +
geom_sf(data = philly, fill = "seashell", color = NA, alpha = 0.6)+
theme_void()
Lets plot our data now. To start, lets join it with our philly census tract map and then add it as a layer.
philly <- philly %>%
left_join(plot2_data) %>%
mutate(Transit = case_when(
is.na(Transit) ~ 0,
TRUE ~ Transit
))
## Joining with `by = join_by(GEOID)`
plot5 <- ggplot() +
geom_sf(data = pa, fill = "seashell", col = "gray", alpha = 0.6)+
geom_sf(data = jersey, fill = "seashell", col = "gray", alpha = 0.6)+
geom_sf(data = philly_water, fill = "dodgerblue4", col = "dodgerblue", alpha = 0.8) +
geom_sf(data = philly,
aes(fill = Transit),
col = "transparent",
linewidth = 0.1)+
scale_fill_gradient( #helps set color for continuous variables
name = "% of Transit Commuters",
high = "maroon4",
low = "mistyrose")+
coord_sf( #specifies the limits of map as limits of Philly
xlim = c(st_bbox(philly)[1], st_bbox(philly)[3]),
ylim = c(st_bbox(philly)[2], st_bbox(philly)[4])
)+
theme_void()
plot5
Lets now work on our customisations.
plot5 +
labs(
title = "Where do Transit Commuters Live in Philadelphia",
caption = "Data: ACS 2022"
)+
theme(
text = element_text(family = "Jost", color = "steelblue4"),
plot.title = element_text(face = "bold", hjust = 0.5, size = 12),
legend.position = "bottom", #moves the legend
legend.key.width = unit(1, "cm"), #specifies its size
plot.margin = margin(10, 10, 10, 10),
legend.title = element_text(vjust = 1, hjust = 0.1)) #adjusts text
Lets try plotting point data of affordable housing next. To start, we would need to do some data manipulation. Annoyingly, point data is difficult to plot in ggplot if the X and Y coordinates are not stored in separate columns. Hence, to start, we would transform our affordable housing’s crs, and create two new columns that subset the longitude and latitude in geometry.
affordablehousing <- st_transform(affordablehousing, 2272)%>%
mutate(
X = st_coordinates(geometry)[,1],
Y = st_coordinates(geometry)[,2]
)
The steps that we would follow is quite similar to what has been outlines previously, except the introduction of a new function (guides). Guides helps you override aesthetic styling in the legend and make the individual circles in the legend bigger.
ggplot() +
geom_sf(data = pa, fill = "seashell3", col = "white", alpha = 0.9)+
geom_sf(data = jersey, fill = "seashell3", col = "white", alpha = 0.9)+
geom_sf(data = philly_water, fill = "dodgerblue4", col = "dodgerblue", alpha = 0.8) +
geom_sf(data = philly, fill = "seashell", color = NA, alpha = 0.6)+
geom_point(data = affordablehousing,
aes(x = X, y = Y, color = PROJECT_TYPE), #specify X and Y
size = 0.8, alpha = 0.9)+
scale_color_manual(values = c("Homeownership" = "#7fc97f", #create the color scale
"Rental" = "#beaed4",
"Special Needs" = "#fdc086"),
name = "Project Type") +
theme_void() +
labs(
title = "Affordable Housing in Philadelphia"
)+
theme(
text = element_text(family = "Jost", color = "steelblue4"),
plot.title = element_text(face = "bold", hjust = 0.5, size = 12, vjust = 0.5),
legend.title = element_text(hjust = 0.5),
legend.position = "bottom",
legend.key.size = unit(5, "mm"),
plot.margin = margin(10, 10, 10, 10))+
guides(color = guide_legend(override.aes = list(size = 3))) #override aesthetic styling
## Warning: Removed 25 rows containing missing values or values outside the scale range
## (`geom_point()`).
And there is our map!